Skip to content

fix(sandbox): refresh docker/podman/vm tokens in gateway#1639

Open
TaylorMutch wants to merge 2 commits into
mainfrom
1603-gateway-managed-token-refresh/tm
Open

fix(sandbox): refresh docker/podman/vm tokens in gateway#1639
TaylorMutch wants to merge 2 commits into
mainfrom
1603-gateway-managed-token-refresh/tm

Conversation

@TaylorMutch
Copy link
Copy Markdown
Collaborator

Summary

Fix sandbox token recovery for singleplayer runtimes by making Docker, Podman, and VM token refresh gateway-managed, with an explicit supervisor auth mode contract for each backend. Kubernetes remains on ServiceAccount exchange instead of gateway-managed token files.

Related Issue

Closes #1603

Changes

  • Add explicit OPENSHELL_SANDBOX_AUTH_MODE values for static, gateway-managed file, gateway-managed supervisor-push, and Kubernetes ServiceAccount exchange paths
  • Refresh Docker, Podman, and VM sandbox tokens from the gateway during startup resume and periodic rotation
  • Add compute-driver resume/write-token RPCs and VM supervisor token push handling
  • Add Docker, Podman, and VM expired-token resume regression tests
  • Update architecture and driver docs for the explicit token ownership model

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated
  • Docker expired-token resume e2e passed
  • Podman expired-token resume e2e passed
  • VM expired-token resume e2e: attempted locally, blocked by missing mkfs.ext4/e2fsprogs prerequisite on this machine

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label May 29, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for df3f0bc. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
@TaylorMutch TaylorMutch force-pushed the 1603-gateway-managed-token-refresh/tm branch from df3f0bc to 7afe767 Compare June 1, 2026 16:08
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
@maxamillion
Copy link
Copy Markdown
Collaborator

maxamillion commented Jun 2, 2026

@TaylorMutch In the process of asking my coding agent questions to wrap my head around some of the implementation details here, I stumbled on the following potential issue where a silent failure can occur in the VM token rotation:

In crates/openshell-server/src/compute/mod.rs:887, the refresh sweep treats write_sandbox_token() success as a completed refresh, then optionally pushes the token over the supervisor stream. But send_sandbox_token_update() returns Ok(false) when no supervisor session is connected (crates/openshell-server/src/supervisor_session.rs:210), and the caller ignores that boolean. For VM, the driver-side write only updates the persisted sandbox request (crates/openshell-driver-vm/src/driver.rs:589); the running guest updates /opt/openshell/auth/sandbox.jwt only when it receives SandboxTokenUpdate (crates/openshell-sandbox/src/supervisor_session.rs:422). If a running VM’s supervisor stream is temporarily disconnected during refresh, the sweep records success while the guest keeps the old token. Once that old JWT expires, reconnects and outbound RPCs can keep failing because no connected session exists to receive the new token. Fix by treating Ok(false) as an undelivered refresh for supervisor-push runtimes and retrying, or by sending the latest minted token during VM supervisor session registration.

@TaylorMutch
Copy link
Copy Markdown
Collaborator Author

@TaylorMutch In the process of asking my coding agent questions to wrap my head around some of the implementation details here, I stumbled on the following potential issue where a silent failure can occur in the VM token rotation:

In crates/openshell-server/src/compute/mod.rs:887 (https://github.com/NVIDIA/OpenShell/pull/1639/files), the refresh sweep treats write_sandbox_token() success as a completed refresh, then optionally pushes the token over the supervisor stream. But send_sandbox_token_update() returns Ok(false) when no supervisor session is connected (crates/openshell-server/src/supervisor_session.rs:210), and the caller ignores that boolean. For VM, the driver-side write only updates the persisted sandbox request (crates/openshell-driver-vm/src/driver.rs:589); the running guest updates /opt/openshell/auth/sandbox.jwt only when it receives SandboxTokenUpdate (crates/openshell-sandbox/src/supervisor_session.rs:422). If a running VM’s supervisor stream is temporarily disconnected during refresh, the sweep records success while the guest keeps the old token. Once that old JWT expires, reconnects and outbound RPCs can keep failing because no connected session exists to receive the new token. Fix by treating Ok(false) as an undelivered refresh for supervisor-push runtimes and retrying, or by sending the latest minted token during VM supervisor session registration.

Yup, this is exactly where I've gotten hung up on this PR and haven't made much progress since I first posted it. The VM implementation in particular has been tricky, and I am definitely open to ideas here.

@maxamillion
Copy link
Copy Markdown
Collaborator

@TaylorMutch I think the main question I have is, is the expectation that the microvm ComputeDriver be a single-host single-player ComputeDriver option similar to podman and docker?

If so, we should be able to make certain assumptions that can make this easier to reason with. Effectively my pitch would be to toss the entire gateway managed file approach and instead use krun_add_virtiofs3 to expose the host OS directory to the VM as a read-only virtio-fs device. Then the refresh logic could be the same as it is for docker and podman. The is similar in approach to how podman --runtime krun volume bind mounts work, though I think what we would need here could be more simple and narrow in scope since we don't need to deal with the OCI or filesystem pre-prep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(auth): local-driver sandboxes cannot restart after on-disk bootstrap JWT expires

2 participants